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Abstract 

In  this  paper  we  describe  a  self-adjusting  algorithm  for  packet  routing,  in 
which  a  reinforcement  learning  module  is  embedded  into  each  node  of  a 
switching  network.  Only  local  communication  is  used  to  keep  accurate  statis¬ 
tics  at  each  node  on  which  routing  policies  lead  to  minimal  delivery  times. 
In  simple  experiments  involving  a  36-node,  irregularly  connected  network, 
this  learning  approach  proves  superior  to  a  nonadaptive  algorithm  based  on 
precomputed  shortest  paths. 
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1  Introduction 


We  present  an  algorithm  for  routing  packets  efficiently  in  an  irregularly- 
connected  communication  network  with  unpredictable  usage  patterns.  The 
algorithm,  related  to  certain  “distributed”  packet  routing  algorithms  [4,  5], 
must  learn  a  routing  policy  whi<  n  balances  minimizing  the  number  of  “hops” 
a  packet  will  take  with  the  possibility  of  congestion  along  popular  routes. 
It  does  this  by  experimenting  with  different  routing  policies  and  gathering 
statistics  about  which  policies  minimize  total  delivery  time. 

Although  in  principle  such  an  approach  could  be  very  expensive  in  terms 
of  learning  time  and  storage  space,  the  algorithm  we  describe  here  is  a  variant 
of  a  reinforcement  learning  algorithm  called  Q-leaming  [7]  which  adapts  to 
changes  in  network  traffic  and  requires  little  more  space  than  that  needed 
to  represent  a  complete  routing  policy.  The  application  of  Q-learning  here 
differs  from  its  traditional  use  in  that  the  network  is  actually  constructed 
from  a  distributed  collection  of  learners,  each  of  which  is  responsible  for 
a  portion  of  the  problem.  This  approach  appears  to  be  very  effective  in 
routing  packets  efficiently  under  high  load. 

The  experiments  in  this  paper  were  carried  out  using  a  discrete  event 
simulator  to  model  the  motion  of  packets  through  a  local  area  network. 
Packets  are  periodically  introduced  into  the  network  at  a  random  node  with  a 
random  destination.  Multiple  packets  at  a  node  are  stored  in  an  unbounded 
FIFO  queue;  however,  we  set  a  limit  on  the  total  number  of  packets  active 
in  the  network  at  a  time,  generally  1000.  In  unit  time,  a  node  takes  the 
top  packet  in  its  queue,  examines  its  destination,  and  chooses  a  neighboring 
node  to  which  to  send  the  packet.  A  packet  sent  directly  to  its  destination 
node  is  removed  from  the  network  immediately. 

2  Routing  as  a  Reinforcement  Learning  Task 

A  packet  routing  policy  answers  the  question:  to  which  adjacent  node  should 
the  current  node  send  its  packet  in  order  to  get  it  as  quickly  as  possible  to 
its  eventual  destination?  Since  the  policy’s  performance  is  measured  by 
the  total  time  taken  to  deliver  a  packet,  there  is  no  “training  signal”  for 
directly  evaluating  or  improving  the  policy  until  a  packet  finally  reaches  its 
destination.  However,  using  an  idea  from  the  field  of  reinforcement  learning, 
we  can  update  the  policy  more  quickly  and  using  only  local  information. 

In  our  learning  scheme,  the  policy  is  distributed  throughout  the  network 
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as  follows:  each  node  keeps  an  estimate,  for  every  neighbor/destination  pair 
(y,d),  of  how  long  it  takes  for  a  packet  with  destination  d  to  arrive  if  first 
sent  to  neighbor  node  y.  When  a  node  z  is  asked  to  route  a  packet,  it  sends 
it  to  that  neighbor  y  which  x  estimates  will  have  the  lowest  total  delivery 
time.  Instead  of  then  waiting  for  the  packet  to  reach  d  before  updating  the 
policy,  x  queries  y  to  find  out  how  long  y  expects  the  given  packet  to  take  to 
get  to  d.  Since  y  is  presumably  closer  to  d,  its  estimate  is  considered  more 
accurate  and  thus  can  be  used  to  update  z’s  delivery  time  estimate. 

More  precisely,  let  Qx(y ,  d)  be  the  time  that  node  x  estimates  it  takes  to 
deliver  a  packet  P  bound  for  node  d  by  way  of  z’s  neighbor  node  y,  including 
any  time  that  P  would  have  to  spend  in  node  z’s  queue.1  Upon  sending  P 
to  y,  x  immediately  gets  back  y's  estimate  for  the  time  remaining  in  the 
trip,  namely 


Qy{z,d)  — 


c  Cy(*,d) 

z£  neighbor*  of  y 


If  the  packet  spent  q  units  of  time  in  z’s  queue,  then  z  can  revise  its  estimate 
as  follows: 

new  estimate  old  estimate 

a Qx(y,d)  =  r,CQvfrd)  +  q-  QliM  ) 


where  q  is  a  “learning  rate”  parameter  (0.7  in  our  experiments). 

In  the  field  of  reinforcement  learning,  the  Q-function  Qx(y,d)  is  often 
approximated  by  a  neural  network  (see  e.g.  [3,  6]);  this  can  allow  the  learner 
to  incorporate  diverse  parameters  of  the  system,  such  as  local  queue  size  and 
time  of  day,  into  its  distance  estimation.  For  the  implementation  described 
here,  however,  we  represented  Q  as  a  table. 


3  Results 

We  tested  our  routing  algorithm  on  a  variety  of  network  topologies,  including 
the  7-hypercube,  a  116-node  LATA  telephone  network,  and  an  irregular 
6x6  grid.  Varying  the  level  of  network  traffic,  we  measured  the  average 
delivery  time  for  packets  in  the  system  after  learning  had  settled  on  a  routing 
policy,  and  compared  these  delivery  times  with  those  given  by  a  conventional 
routing  scheme  based  on  shortest  paths.  The  result  was  that  in  all  cases, 
the  learning  algorithm  was  able  to  sustain  a  higher  level  of  network  traffic 
than  the  non-learning  one. 

’We  denote  the  function  by  Q  because  it  corresponds  to  the  “Q-function”  used  in  the 
reinforcement  learning  technique  of  Q-learning  [7]. 
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Figure  2:  Performance  under  low  load  and  high  load 
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Here  we  present  the  results  for  the  irregular  grid  network  (pictured  in 
Figure  1).  Under  conditions  of  low  load,  the  network  learns  fairly  quickly  to 
route  packets  along  shortest  paths  to  their  destination.2  The  performance 
vs.  time  curve  plotted  in  the  left  part  of  Figure  2  demonstrates  that  our 
routing  algorithm,  using  no  prior  knowledge  of  the  network  topology,  learns 
to  route  about  as  well  as  the  shortest  path  router,  which  performs  optimally 
in  low  load. 


1  $4  131-1-1 7- . 116  125-155 


207  45-  -54  54  -  43  109 


206  344370- . 573-330278 


1 4O-10O  249  2$5--l05-l40 


05  143  146  1  $4  143108 


384  -392  396 . 396  393  3$7 


375  102  59 


54  -82  377 


45  76  58 


58-63-41 


394  -202  2$8 . 200  246  303 


202  248- 144  2^7  201  21 7 


1  to  140  03  lOO  l4l  l02 


79  -105  75  108  121  74 


Figure  3:  Policy  summaries:  shortest  path  and  learned  under  high  load 


As  network  traffic  increases,  however,  the  shortest  path  routing  scheme 
is  far  from  optimal:  it  ignores  bottlenecks  and  soon  floods  the  network 
with  packets.  The  right  part  of  Figure  2  plots  performance  vs.  time  for 
the  two  routing  schemes  under  high  load  conditions:  while  shortest  path  is 
unable  to  tolerate  the  packet  load,  our  algorithm  learns  an  efficient  routing 
policy.  The  reason  for  the  learning  algorithm’s  success  is  apparent  in  the 
“policy  summary  diagrams”  in  Figure  3.  These  diagrams  indicate,  for  each 
node  under  a  given  policy,  how  many  of  the  36  x  35  point-to-point  routes 
go  through  that  node.  In  the  left  part  of  Figure  3,  which  summarizes  the 

2 In  fact,  it  is  interesting  to  note  that  the  Q-learning  update  rule  is  mathematically 
very  much  like  the  well-known  Bellman-Ford  shortest  paths  algorithm  [1,  2],  except  our 
path  relaxation  steps  are  performed  asynchronously. 
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shortest  path  routing  policy,  two  nodes  in  the  center  of  the  network  (labelled 
570  and  573)  are  on  many  shortest  paths  and  thus  become  congested  when 
network  load  is  high.  By  contrast,  the  diagram  on  the  right  shows  that 
our  algorithm,  under  conditions  of  high  load,  has  learned  a  policy  which 
routes  some  traffic  over  a  longer  than  necessary  path  (across  the  top  of  the 
network)  so  as  to  avoid  congesting  the  nodes  in  the  center  of  the  network. 

Our  basic  result  is  captured  in  Figure  4,  which  compares  the  perfor¬ 
mances  of  the  shortest  path  policy  and  learned  policy  as  the  network  load 
increases.  Each  point  represents  the  median  over  three  trials  of  the  mean 
packet  delivery  time  (after  learning  has  settled).  When  the  load  is  very  low, 
our  learning  algorithm  routes  nearly  as  efficiently  as  the  shortest  path  pol¬ 
icy.  As  load  increases,  the  shortest  path  policy  leads  to  exploding  levels  of 
network  congestion,  whereas  the  learning  algorithm  continues  to  route  effi¬ 
ciently.  Only  after  a  further  significant  increase  in  load  does  the  distributed 
learning  algorithm,  too,  result  in  congestion. 
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Figure  4:  Delivery  time  at  various  loads 
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4  Conclusion 


In  this  paper,  we  have  exhibited  a  learning  algorithm  which,  without  having 
to  know  in  advance  the  network  topology  and  traffic  patterns,  and  without 
the  need  for  any  centralized  routing  control  system,  is  able  to  discover  an 
efficient  routing  policy.  Although  the  simulations  described  here  are  not 
fully  realistic  from  the  standpoint  of  actual  telecommunication  networks, 
we  believe  that  reinforcement  learning  algorithms  such  as  this  one  could  be 
effective  for  self-adapting  routing  and  other  practical  tasks. 
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